A Novel Approach to Integrated Search Information Retrieval Technique for Hidden Web for Domain Specific Crawling
نویسندگان
چکیده
The traditional web crawlers retrieve contents from only the “Surface web” and are unable to crawl through the hidden portion of the Web containing high quality information which is dynamically generated through querying databases when the queries are submitted through a search interface. For Hidden web, most of the published research has been done to identify/detect such searchable forms and make a systematic search over these forms. One approach is based on a Web crawler that identifies, analyzes search forms and fills them with appropriate content to retrieve maximum relevant information from the deep web. A critical problem with this is the search interface integration followed by the integration of the results obtained as a result of parallel request quest. This paper proposes a technique to detect and construct an integrated query interface that integrates a set of web interfaces over a given domain of interest. It provides users to access information uniformly from multiple sources. The proposed strategy does that by focusing the crawl on a given topic; by judiciously analyzing the domain specific matching, mapping information/key attributes which leads to web pages containing domain specific search forms in integrated manner. The interface mapping and matching library, semantic knowledgebase specific to the domain play a very key role here and demand for continuous evolution, improvement to make hidden web crawling effective and usable.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملPublic Transport Ontology for Passenger Information Retrieval
Passenger information aims at improving the user-friendliness of public transport systems while influencing passenger route choices to satisfy transit user’s travel requirements. The integration of transit information from multiple agencies is a major challenge in implementation of multi-modal passenger information systems. The problem of information sharing is further compounded by the multi-l...
متن کاملFederated Search
Federated search (federated information retrieval or distributed information retrieval) is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized se...
متن کاملFederated Search Milad Shokouhi
Federated search (federated information retrieval or distributed information retrieval) is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized se...
متن کاملCollecte orientée sur le Web pour la recherche d'information spécialisée. (Focused document gathering on the Web for domain-specific information retrieval)
Focused document gathering on the Web for domain-specific information retrieval Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithm...
متن کامل